Session 5: Scaling, Reporting and Database Software

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2025-07-11

Introduction

This Course

Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

The Plan for Today

In this session, you learn:

  • Repetition: DBMS
  • Working with PostgreSQL
  • Working with text databases
  • Benchmarking
  • Final scaling tips

Nik via unsplash.com

Databases

DBMS: servers and clients

  • most DBMS are set up in client-server architecture:
    • server: can be a computer somewhere or a process on your own computer that fulfills requests
    • DBMS server: contains the database and database management system
    • client: interacts with server (sends requests, receives responses)
    • DBMS client: can upload and retrieve data from server or send processing instructions
  • Why though:
    • server can run on more powerful hardware somewhere else
    • requests from multiple users don’t interfere with each other
    • access control per user to support different roles

Interacting with a database management system from Weidmann (2023), p. 105.

Got Server?

https://azure.microsoft.com/en-us/free/

RDBMS software

  • many popular choices alternatives:
    • Oracle Database
    • MySQL
    • Microsoft SQL Server
    • IBM Db2
    • Microsoft Access
    • SQLite (free software)
    • MariaDB (free software)
  • all use slightly different dialects of SQL, but the core functionality is the same
  • We will use PostgreSQL: free and open source, well-known, and many other programming languages and tools can interface to it

Working with PostgreSQL

Installation

We are using Docker to spin up a local server that has PostgreSQL already installed.

You have to type this into your Terminal(!):

docker-compose -f 05_Scaling_Reporting_and_Database_Software/data/docker-compose.yml up -d

This is running the compose file below:

services:
  postgres:
    image: postgres
    container_name: postgres_db
    environment:
      POSTGRES_USER: postgres
      POSTGRES_PASSWORD: pgpasswd
      POSTGRES_DB: dbintro
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Connecting from R

library(DBI)
library(RPostgres)
db <- dbConnect(
  Postgres(),
  dbname = "dbintro",
  host = "localhost",  # Make sure to specify the host
  port = 5432L,
  user = "postgres",
  password = "pgpasswd"
)

Let’s fill this database with the nycflights13 flights data:

library(nycflights13)
dbWriteTable(db, "airports", airports, overwrite = TRUE)
dbWriteTable(db, "flights", flights, overwrite = TRUE)
dbWriteTable(db, "weather", weather, overwrite = TRUE)
dbWriteTable(db, "planes", planes, overwrite = TRUE)
dbWriteTable(db, "airlines", airlines, overwrite = TRUE)

Connecting from the terminal

docker exec -it -u postgres postgres_db psql -d dbintro

Inside the Docker container (that is the name of a server running via docker), you can list tables with:

\dt

You can run SQL operations in here without R:

SELECT * FROM airports;
#                     ^

Note that commands are only executed when SQL encounters a ;!

  • Exit a long print with q

Working with PostgreSQL: pretty similar to SQLite…

But not the same…

Error : Failed to fetch row : ERROR:  table "df1" does not exist
Error : Failed to fetch row : ERROR:  table "df2" does not exist
dbExecute(db,
          "CREATE TABLE df1 (
              id SERIAL PRIMARY KEY,
              capital_letters VARCHAR(1) CHECK (capital_letters ~ '^[A-Z]$'), 
              my_date DATE CHECK (my_date > '2000-01-01')
          )")
[1] 0

Working with PostgreSQL and dbplyr: the same as SQLite

I copied this from the last session, and it works just as well:

tbl(db, "flights") |> 
  inner_join(tbl(db, "planes"), by = "tailnum", suffix = c("", "_plane")) |> 
  mutate(plane_age = year - year_plane) |> 
  select(arr_delay, plane_age) |> 
  filter(!is.na(arr_delay),
         !is.na(plane_age)) |> 
  collect() |> 
  group_by(plane_age) |> 
  summarise(avg_delay = mean(arr_delay)) |> 
  ggplot(aes(x = plane_age, y = avg_delay)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x)

User-based authentication

Imagine this:

  • you have a team of researchers
  • some are responsible for data gathering (annotation, webscraping, conducting survey waves, etc.)
  • some are responsible for data analysis
  • you are responsible for data management and public reporting
  • You want to make sure that:
    • researchers in team gathering do not overwrite each others changes
    • team analysis always has the newest data
    • the summary data on the website is online as soon as possible
  • with PostgreSQL you can make sure the two groups don disrupt each other and the most recent results are pull from the database automatically

User-based authentication: users

We create three new users:

  • one for the one researchers in the gatherer group
  • one for the one analyser in the analysis group
  • one called “reader” which represents the general public
dbExecute(db, "CREATE USER gatherer WITH ENCRYPTED PASSWORD 'supersecret'")
[1] 0
dbExecute(db, "CREATE USER analyser WITH ENCRYPTED PASSWORD 'supersecret'")
[1] 0
dbExecute(db, "CREATE USER reader WITH ENCRYPTED PASSWORD 'supersecret'")
[1] 0

We can look at users:

dbGetQuery(db, "SELECT usename FROM pg_user")
   usename
1 postgres
2 gatherer
3 analyser
4   reader

User-based authentication: users

Let’s log in as analyser:

db_analyser <- dbConnect(
  Postgres(),
  dbname = "dbintro",
  host = "localhost",  # Make sure to specify the host
  port = 5432L,
  user = "analyser",
  password = "supersecret"
)

User-based authentication: roles

So far, neither of the new users can do anything:

tbl(db_analyser, "flights")
Error in `db_query_fields.DBIConnection()`:
! Can't query fields.
ℹ Using SQL: SELECT * FROM "flights" AS "q01" WHERE (0 = 1)
Caused by error:
! Failed to fetch row : ERROR:  permission denied for table flights

They have to be assigned roles first

User-based authentication: roles

We first give the “gatherer” user permission to update a table:

dbExecute(db, "GRANT UPDATE,INSERT ON flights TO gatherer")
[1] 0

Since we do not fully trust the analyser, we give them read-only access to flights, but access to everything in df1, which we pretend that the results of the analysis are stored in:

dbExecute(db, "GRANT SELECT ON flights TO analyser")
[1] 0
dbExecute(db, "GRANT ALL PRIVILEGES ON df1 TO analyser")
[1] 0

Finally, the public gets selected access to only some columns in the results table:

dbExecute(db, "GRANT SELECT (capital_letters) ON df1 TO reader")
[1] 0

User-based authentication: roles (analyser)

Still logged in as the analyser, let’s try to access the data again:

tbl(db_analyser, "flights")
# Source:   table<"flights"> [?? x 19]
# Database: postgres  [analyser@localhost:5432/dbintro]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

But we still can’t make changes:

dbExecute(db_analyser,
          "INSERT INTO flights (year)
            VALUES (2013)")
Error: Failed to fetch row : ERROR:  permission denied for table flights

We can add data to the tables that we were granted access to though:

dbExecute(db_analyser,
          "INSERT INTO df1
            VALUES (1, 'A', '2025-07-11')")
[1] 1

User-based authentication: roles (gatherer)

Logging in as the gatherer, we can add new cases:

db_gatherer <- dbConnect(
  Postgres(),
  dbname = "dbintro",
  host = "localhost",  # Make sure to specify the host
  port = 5432L,
  user = "gatherer",
  password = "supersecret"
)
dbExecute(db_gatherer,
          "INSERT INTO flights (year)
            VALUES (2013)")
[1] 1

User-based authentication: roles (reader)

Logging in as the reader, we can can’t change anything, but can only read specific columns:

db_reader <- dbConnect(
  Postgres(),
  dbname = "dbintro",
  host = "localhost",  # Make sure to specify the host
  port = 5432L,
  user = "reader",
  password = "supersecret"
)
dbGetQuery(db_reader, "SELECT * FROM df1")
Error: Failed to fetch row : ERROR:  permission denied for table df1
dbGetQuery(db_reader, "SELECT capital_letters FROM df1")
  capital_letters
1               A

Now we could give out this user to the public without the need to worry that they change or read anything they are not supposed to.

Exercises 1

Using the PostgreSQL database or the SQLite database from session 4:

  1. Right join results_state and facts using dbplyr instead of dbGetQuery
  2. Recreate the table results_state_time by querying and joining from db (using dbplyr instead of dbGetQuery).
  3. Recreate results_state_facts using dbplyr. Don’t forget to add total_votes and pct_votes
  4. Extract the SQL query from your code in 3. and run it with dbGetQuery

Working with text data in AmCAT

Why AmCAT

  • Optimized to store, annotate, preprocess, search, share and present text data collections
  • For teams or individual researchers
  • Fine-grained access control
  • Free and Open Source

Installation

Same as with PostgreSQL, we can use Docker.

  1. Download the compose file:
curl::curl_download(
  url = "https://raw.githubusercontent.com/ccs-amsterdam/amcat4docker/main/docker-compose.yml", 
  destfile = "data/docker-compose-amcat.yml"
)
  1. Spin it up via the Terminal
docker-compose -f 05_Scaling_Reporting_and_Database_Software/data/docker-compose-amcat.yml up -d

Connecting from the terminal

Creating a test index:

docker exec -it amcat4 amcat4 create-test-index

Configure the AmCAT server:

docker exec -it amcat4 amcat4 config
#> Reading/writing settings from .env
#> 
#> host: Host this instance is served at (needed for checking tokens)
#> The current value for host is http://localhost/amcat.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> elastic_password: Elasticsearch password. This the password for the #> 'elastic' user when Elastic xpack security is enabled
#> The current value for elastic_password is None.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> elastic_host: Elasticsearch host. Default: https://localhost:9200 if #> elastic_password is set, http://localhost:9200 otherwise
#> The current value for elastic_host is http://elastic8:9200.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> elastic_verify_ssl: Elasticsearch verify SSL (only used if #> elastic_password is set). Default: True unless host is localhost)
#> The current value for elastic_verify_ssl is True.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> system_index: Elasticsearch index to store authorization information #> in
#> The current value for system_index is amcat4_system.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> auth: Do we require authorization?
#>   Possible choices:
#>   - no_auth: everyone (that can reach the server) can do anything #> they want
#>   - allow_guests: everyone can use the server, dependent on #> index-level guest_role authorization settings
#>   - allow_authenticated_guests: everyone can use the server, if they #> have a valid middlecat login,
#> and dependent on index-level guest_role authorization settings
#>   - authorized_users_only: only people with a valid middlecat login #> and an explicit server role can use the server
#> 
#> The current value for auth is AuthOptions.no_auth.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> middlecat_url: Middlecat server to trust as ID provider
#> The current value for middlecat_url is https://middlecat.net.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> admin_email: Email address for a hardcoded admin email (useful for #> setup and recovery)
#> The current value for admin_email is None.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> minio_host: None
#> The current value for minio_host is None.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> minio_tls: None
#> The current value for minio_tls is False.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> minio_access_key: None
#> The current value for minio_access_key is None.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> 
#> minio_secret_key: None
#> The current value for minio_secret_key is None.
#> Enter a new value, press [enter] to leave unchanged, or press #> [control+c] to abort: 
#> *** Written .env file to .env ***

Connecting from R

We need to log in, which you can do without a user by default:

# remotes::install_github("ccs-amsterdam/amcat4r")
library(amcat4r)
amcat_login("http://localhost/amcat")

We can have a look at the example corpus with:

query_documents(index = "state_of_the_union")
# A tibble: 200 × 3
   .id      title                   date               
   <id_col> <chr>                   <dttm>             
 1 9Vs…TGr  1790: George Washington 1790-01-08 00:00:00
 2 9ls…TGr  1790: George Washington 1790-12-08 00:00:00
 3 91s…TGr  1791: George Washington 1791-10-25 00:00:00
 4 -Fs…TGr  1792: George Washington 1792-11-06 00:00:00
 5 -Vs…TGr  1793: George Washington 1793-12-03 00:00:00
 6 -ls…TGr  1794: George Washington 1794-11-19 00:00:00
 7 -1s…TGr  1795: George Washington 1795-12-08 00:00:00
 8 _Fs…TGr  1796: George Washington 1796-12-07 00:00:00
 9 _Vs…TGr  1797: John Adams        1797-11-22 00:00:00
10 _ls…TGr  1798: John Adams        1798-12-08 00:00:00
# ℹ 190 more rows

Dataset

Load data from Rauh and Schwalbach (2020) into AmCAT:

if (!"houseofcommons" %in% list_indexes()$id) {
  corp_hoc_df <- readRDS("../03_Working_with_Files/data/Corp_HouseOfCommons_V2.rds") |> 
    mutate(date = as.Date(date)) |> 
    rename(title = agenda) |> 
    filter(!is.na(date)) |> 
    replace_na(list(title = "", text = "")) |> 
    select(-party.facts.id)
  
  # speed this up by only using a sample
  # corp_hoc_df <- corp_hoc_df |> 
  #   sample_n(size = 500)
  
  # define types of fields
  fields = list(
    date = "date",
    text = "text",
    title = "text",        
    speechnumber = "integer",   
    speaker = "keyword",       
    party = "keyword",
    chair = "boolean",      
    terms = "integer",          
    parliament = "keyword", 
    iso3country = "keyword"
  )
  # create the index
  create_index(index = "houseofcommons", 
               name = "House Of Commons", 
               description = "HouseOfCommons", 
               create_fields = fields)
  
  # upload the data
  upload_documents("houseofcommons", documents = corp_hoc_df, chunk_size = 1000, verbose = TRUE) 
}

AmCAT GUI

AmCAT API

  • R and Python packages
  • OpenAPI specifications
  • Search, upload, download, modify data (e.g., add keywords/categories)
  • User and access management
  • Easy way to make research reproducible by querying data for analysis
  • Great for larger than memory data sets

AmCAT API

query_documents(index = "houseofcommons", queries = "*Europ*")
# A tibble: 200 × 3
   .id      date                title                                           
   <id_col> <dttm>              <chr>                                           
 1 TT0…mBx  1988-12-01 00:00:00 Engagements [Oral Answers To Questions > Prime …
 2 fD0…mBy  1988-12-01 00:00:00 Patrick Ryan                                    
 3 jz0…mBy  1988-12-01 00:00:00 Patrick Ryan                                    
 4 pD0…mBy  1988-12-01 00:00:00 Business Of The House                           
 5 tj0…mBy  1988-12-01 00:00:00 Business Of The House                           
 6 0j0…mBy  1988-12-01 00:00:00 Business Of The House                           
 7 _z0…mBz  1988-12-01 00:00:00 Fisheries                                       
 8 BD0…mFz  1988-12-01 00:00:00 Fisheries                                       
 9 Bz0…mFz  1988-12-01 00:00:00 Fisheries                                       
10 Cz0…mFz  1988-12-01 00:00:00 Fisheries                                       
# ℹ 190 more rows
query_documents(index = "houseofcommons", queries = '"European Union" OR EU')
# A tibble: 200 × 3
   .id      date                title                                           
   <id_col> <dttm>              <chr>                                           
 1 Oj0…XJr  1988-12-19 00:00:00 Adjournment (Christmas)                         
 2 8z0…JZJ  1989-02-07 00:00:00 Nato Secretary-General [Oral Answers To Questio…
 3 4T0…VqA  1988-11-25 00:00:00 Foreign Affairs And Defence [Orders Of The Day …
 4 _j0…VqA  1988-11-25 00:00:00 Foreign Affairs And Defence [Orders Of The Day …
 5 Aj0…VuB  1988-11-25 00:00:00 Foreign Affairs And Defence [Orders Of The Day …
 6 Kz0…VuB  1988-11-25 00:00:00 Debate On The Address [Orders Of The Day > Orde…
 7 Fj0…1pw  1988-11-24 00:00:00 Business Of The House [Prayers > Bills Presente…
 8 0T0…MJ1  1989-03-21 00:00:00 Advisory Committee For Wales [Orders Of The Day…
 9 ez0…fMs  1989-05-18 00:00:00 Developments In The European Community          
10 hT0…fMs  1989-05-18 00:00:00 Developments In The European Community          
# ℹ 190 more rows
query_aggregate(index = "houseofcommons",
                filters = list(party = c("Con", "Lab"),
                               date = list(gte = "2000-01-01")),
                axes = list(list(field="party"), list(field="date", interval="year")),
                queries = '"European Union" OR EU')
# A tibble: 40 × 3
   party date_year      n
   <chr> <chr>      <int>
 1 Lab   2000-01-01  1340
 2 Lab   2001-01-01  1214
 3 Lab   2002-01-01  1242
 4 Lab   2003-01-01  1733
 5 Lab   2004-01-01  1507
 6 Lab   2005-01-01  1392
 7 Lab   2006-01-01  1160
 8 Lab   2007-01-01  1308
 9 Lab   2008-01-01  2407
10 Lab   2009-01-01  1051
# ℹ 30 more rows

Access Control in Amcat4 enables Non-Consumptive Research

  • Sharing data helps scientific progress!
  • Copyright and/or (privacy-)sensitive material can not be easily shared
  • Definition: non-consumptive research involves the use of computational methods to analyze data without giving access to the data itself

Exercises 2

  1. Connect to your own AmCAT server (alternatively, connect to mine)
  2. Create a new index and add some toy data to it that has: a date, a title, and a text field
  3. Query all documents from your index
  4. query the documents from the “houseofcommons” index that have:
    1. mentions of protest, protesters, protester, or protesting
    2. no mentions of protestant, in addition to 4.1
    3. contain speeches by the Conservative Party, in addition to 4.1 and 4.2 (hint: use a filter)
  5. Count (hint: aggregate) how often the Conservatives mention protest words each month after 2000

Scaling your research

Benchmarking

  • when your data grows, it becomes important that your code is fast, since it is applied to many cases
  • benchmarking helps you identify bottlenecks
  • it can make the difference between minutes and days whether your code is optimized for speed

Benchmarking with bench

Basic syntax:

  1. define functions that wrap your code
library(bench)
fun1 <- function() {
  1 + 1
}
fun2 <- function() {
  a <- 1
  b <- 1
  sum(c(a, b))
}
  1. run mark
res <- mark(simple = fun1, complex = fun1)
  1. check summary
summary(res)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 simple         12ns   19.1ns 44373783.        0B        0
2 complex      12.9ns     21ns 39553728.        0B        0
summary(res, relative = TRUE)
# A tibble: 2 × 6
  expression   min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 simple      1      1         1.12       NaN      NaN
2 complex     1.08   1.10      1          NaN      NaN

Benchmarking: in memory vs in database

  1. define functions that wrap your code
in_memory <- function() {
  nycflights13::flights |> 
    inner_join(nycflights13::weather, by = c("time_hour", "origin")) |> 
    filter(time_hour < "2013-03-01")
}

in_db <- function() {
  tbl(db, "flights") |> 
    inner_join(tbl(db, "weather"), by = c("time_hour", "origin")) |> 
    filter(time_hour < "2013-03-01") |> 
    collect()
}
  1. run mark
res <- mark(in_memory = in_memory, in_db = in_db, check = FALSE, iterations = 15)
  1. check summary
summary(res)
# A tibble: 2 × 6
  expression      min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 in_memory      15ns   22.9ns 17523326.        0B        0
2 in_db          15ns   21.1ns 20154703.        0B        0
summary(res, relative = TRUE)
# A tibble: 2 × 6
  expression   min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr> <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 in_memory      1   1.09      1          NaN      NaN
2 in_db          1   1         1.15       NaN      NaN

Fix bottlenecks

Without an index, the search in the database takes longer. You can create it automatically with copy_to() (see this)

dbWriteTable(db, "flights_w_index", flights, overwrite = TRUE)
dbWriteTable(db, "weather_w_index", weather, overwrite = TRUE)
dbExecute(db, "CREATE INDEX ON flights_w_index (time_hour)")
[1] 0
dbExecute(db, "CREATE INDEX ON weather_w_index (time_hour)")
[1] 0
  1. define functions that wrap your code
in_memory <- function() {
  nycflights13::flights |> 
    inner_join(nycflights13::weather, by = c("time_hour", "origin")) |> 
    filter(time_hour < "2013-03-01")
}

in_db <- function() {
  tbl(db, "flights") |> 
    inner_join(tbl(db, "weather"), by = c("time_hour", "origin")) |> 
    filter(time_hour < "2013-03-01") |> 
    collect()
}

in_db_w_index <- function() {
  tbl(db, "flights_w_index") |> 
    inner_join(tbl(db, "weather_w_index"), by = c("time_hour", "origin")) |> 
    filter(time_hour < "2013-03-01") |> 
    collect()
}
  1. run mark
res <- mark(
  in_memory = in_memory, 
  in_db = in_db, 
  in_db_w_index = in_db_w_index, 
  check = FALSE, 
  iterations = 25
)
  1. check summary
summary(res)
# A tibble: 3 × 6
  expression         min   median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
1 in_memory         18ns     25ns 11169685.        0B        0
2 in_db             19ns     27ns 13425129.        0B        0
3 in_db_w_index   22.1ns   33.1ns 17416737.        0B        0
summary(res, relative = TRUE)
# A tibble: 3 × 6
  expression      min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>    <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 in_memory      1      1         1          NaN      NaN
2 in_db          1.05   1.08      1.20       NaN      NaN
3 in_db_w_index  1.23   1.32      1.56       NaN      NaN

Data Management: Summary

What for?

  • enables new research questions by being able to combine and clean new data
  • let’s you sleep easy at night
  • makes your work transparent to others and yourself
  • thinking about reproducibility from the start

When to use files vs. databases

  • often not clear from the start
  • working with files is easier as a solo researchers (who keeps a clean project directory)
  • when data changes regularly it makes sense to not add to the same file again and again
  • when several researchers collaborate, a database solves many issues
  • when you have text data, vectors, or complex elements, NoSQL databases like AmCAT or MongoDB can make sense

When done, dbDisconnect

Whenever you are done working with a database, you should disconnect from it:

dbDisconnect(db)
dbDisconnect(db_analyser)
dbDisconnect(db_gatherer)
dbDisconnect(db_reader)
  • This closes the connection, discards all pending work, and frees resources

Reminder: Social Programme

DATE Event Time Venue
MONDAY 7 July Meet and Greet - in person 19:00 start SU Bar
TUESDAY 8 July Climbing 18:30 start Sports Centre
WEDNESDAY 9 July Harold Clarke Speaker Series - hybrid 18:45 - 20.00 EBS
THURSDAY 10 July Sports Night 18:30 - 20:30 Sports Centre
FRIDAY 11 July Wivenhoe Pub Run 18:30 start Wivenhoe pubs
MONDAY 14 JULY SU bar Quiz 19:00 start SU Bar
TUESDAY 15 JULY Sports Night 18:30 - 20:30 Sports Centre
WEDNESDAY 16 JULY Harold Clarke Speaker Series - hybrid 18:30 EBS
THURSDAY 17 JULY Farewell Party Karaoke 20:30 - 23:30 SU Bar

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Berlin
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bench_1.1.4         amcat4r_4.0.15.9000 nycflights13_1.0.2 
 [4] RPostgres_1.4.8     DBI_1.2.3           lubridate_1.9.4    
 [7] forcats_1.0.0       stringr_1.5.1       dplyr_1.1.4        
[10] purrr_1.0.4         readr_2.1.5         tidyr_1.3.1        
[13] tibble_3.3.0        ggplot2_3.5.1       tidyverse_2.0.0    
[16] tinytable_0.8.0    

loaded via a namespace (and not attached):
 [1] utf8_1.2.6        rappdirs_0.3.3    generics_0.1.3    lattice_0.22-7   
 [5] stringi_1.8.7     hms_1.1.3         digest_0.6.37     magrittr_2.0.3   
 [9] evaluate_1.0.3    grid_4.5.1        timechange_0.3.0  fastmap_1.2.0    
[13] blob_1.2.4        Matrix_1.7-3      jsonlite_2.0.0    mgcv_1.9-3       
[17] fansi_1.0.6       scales_1.3.0      httr2_1.1.2       cli_3.6.5        
[21] rlang_1.1.6       dbplyr_2.5.0      litedown_0.7      bit64_4.6.0-1    
[25] munsell_0.5.1     splines_4.5.1     withr_3.0.2       yaml_2.3.10      
[29] tools_4.5.1       tzdb_0.5.0        colorspace_2.1-1  profmem_0.7.0    
[33] curl_6.3.0        vctrs_0.6.5       R6_2.6.1          lifecycle_1.0.4  
[37] bit_4.6.0         pkgconfig_2.0.3   pillar_1.10.2     gtable_0.3.6     
[41] glue_1.8.0        xfun_0.52         tidyselect_1.2.1  rstudioapi_0.17.1
[45] knitr_1.50        farver_2.1.2      nlme_3.1-168      htmltools_0.5.8.1
[49] labeling_0.4.3    rmarkdown_2.29    compiler_4.5.1    askpass_1.2.1    
[53] openssl_2.3.3    

References

Rauh, Christian, and Jan Schwalbach. 2020. The ParlSpeech V2 data set: Full-text corpora of 6.3 million parliamentary speeches in the key legislative chambers of nine representative democracies.” Harvard Dataverse. https://doi.org/10.7910/DVN/L4OAKN.
Weidmann, Nils B. 2023. Data Management for Social Scientists: From Files to Databases. 1st ed. Cambridge University Press. https://doi.org/10.1017/9781108990424.